Preamble

This is an IPython Notebook to accompany the paper entitled 'The Architecture of Genome Wide Association Studies' by Melinda Mills and Charles Rahal. This allows us to share data analysis done in the construction of this data-driven review of all GWAS to date. It can also be used to dynamically replicate the analysis herein going forward. For installation guidelines, please see the readme.md which accompanies this repository. A preliminary point to note is that if you wish to run this .ipynb file itself, you may need to override the default settings of some versions of Jupyter Notebook (4.2 to 5.1) by opening with:

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

or editing your jupyter_notebook_config.py file. Lets begin by loading in our favourite python modules, ipython magic and a bunch of custom functions we've written specifically for this project:

In [1]:
import networkx as nx
import os
import pandas as pd
import re
import seaborn as sns
import sys
import numpy as np
import csv
import itertools
import warnings
import gender_guesser.detector as gender
from Bio import Entrez
from IPython.display import HTML, display
from Support.LoadData import (
    EFO_parent_mapper,
    load_gwas_cat,
    load_pubmed_data, 
    make_timely,
    make_clean_CoR,
    download_cat
)
from Support.PubMed import (
    build_collective,
    build_author,
    build_funder,
    build_abstract,
    build_citation
)
from Support.Analysis import (
    simple_facts,
    ancestry_parser,
    make_meanfemale_andranks,
    make_funders,
    mapped_trait_summary
)
from Support.Figures import (
    gwas_growth,
    choropleth_map,
    wordcloud_figure,
    plot_heatmap,
    plot_bubbles,
    boxswarm_plot
)
from Support.Additional import clean_names
from Support.Ancestry import ancestry_cleaner

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'png'
%matplotlib inline
%load_ext autoreload
%autoreload 2

Then, let's dynamically grab the three main curated datasets from the GWAS Catalogue EBI website that we will need for our endeavours ('All Associations','All Studies', 'All Ancestries') and the EFO to Parent trait mapping file from their FTP site:

In [2]:
output_path = os.path.abspath(os.path.join('__file__',
                                    '../..',
                                    'data',
                                    'Catalogue',
                                    'Raw'))
ebi_download = 'https://www.ebi.ac.uk/gwas/api/search/downloads/'
download_cat(output_path, ebi_download)

Lets link the PUBMEDID variable to the PUBMED API and get a series of datasets from that using the support functions written in PubMed.py. Note: collective corresponds mostly consortia and study groups.

In [3]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
id_list = set(Cat_Studies['PUBMEDID'].astype(str).tolist())
Entrez.email = 'your@email.com'
papers = Entrez.read(Entrez.efetch(db='pubmed',retmode='xml', id=','.join(id_list)))
build_collective(papers)
build_author(papers)
build_funder(papers)
build_abstract(papers)
build_citation(id_list,Entrez.email)
Number of Collectives Found: 1559!
Authors with last names but no forenames: 8 out of 121258
Built a database of Authors from list of PUBMEDID IDs!
Built a database of Funders from list of PUBMEDID IDs!
Built a database of Abstracts from list of PUBMEDID IDs!
Processing for Citations: : 100%|████████████████████████████████████████████████| 3596/3596 [2:27:32<00:00,  2.46s/it]

Introduction

Some simple 'facts'

Lets do some basic descriptive analysis to see what we've got here:

In [4]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
simple_facts(Cat_Studies, Cat_Ancestry,
             Cat_Ancestry_groupedbyN, Cat_Full,
             AuthorMaster, mostrecentgwascatdate='2018-08-29')
There are currently: 3596 GWAS papers published (in terms of unique PUBMEDIDs)
The first observed GWAS in the catalogue was dated: 2005-03-10
However, only: 10 papers were published in 2005 and 2006 combined
There are currently: 5774 unique Study Accessions
There are currently: 3472 unique Diseases/Traits studied
These correspond to: 2495 unique EBI "Mapped Traits"
The total number of Associations found is currently: 87590
The average number of Associations is currently: 15.2
Mean P-Value for the strongest SNP risk allele is currently: 1.402e-06
The number of associations reaching the 5e-8 threshold is 47104
The journal to feature the most GWAS studies since 2005-03-10 is: Nat Genet
However, in 2017, Nat Commun published the largest number of GWAS papers
So far, in 2018, Nat Genet published the largest number of GWAS papers
Largest Accession to date: 808380.0 people.
This was published in Nat Commun.
The first author was: Helgadottir A.
Total number of SNP-trait associations is 87590.
Total number of journals publishing GWAS is 467.
The study with the largest number of authors has: 559 authors.
The current publication lag in the Catalog is: 74 days
The lag between the date it was most recently published and date  of last included GWAS: 15 days

Can the abstracts give us any clues?

Lets make a nice infographic and do some basic NLP on the abstracts from all PUBMED IDs. This figure is the header on the readme.md:

In [5]:
abstract_count = os.path.abspath(
                 os.path.join('__file__', '../..', 'data', 'PUBMED',
                              'Pubmed_AbstractCount.csv'))
output_file = os.path.abspath(
              os.path.join('__file__', '../../', 'figures', 'pdf',
                           'helix_wordcloud_1250_5000_black.pdf'))
wordcloud_figure(abstract_count, output_file)

The file Pubmed_AbstractCount in PUBMED (subdirectory of Data) details a breakdown of the abstract counts. Check counts for 'European', 'Asian', etc.

The growth in depth and scope of GWAS over time

We can also visualise the ever increasing sample sizes and the popularity of GWAS. The top panel shows the increasing number of GWAS conducted over time. We also see increasingly large sample sizes: i.e. the fraction of 'Big N' sample size studies increases. The bottom left subplot shows that with this growth comes bigger N and a high correlatation with the number of significant Associations found. The right subplot shows the unique number of journals publishing GWAS per year and the unique number of diseases or traits studied. Both steadily increase as the appeal of GWASs broadens.

In [6]:
output_file = os.path.abspath(
              os.path.join('__file__', '../../', 'figures', 'svg',
                           'Figure_1.svg'))
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
gwas_growth(output_file, Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN)

Participants

Ancestry

This section of analysis only uses data contained within the catalogue, but is slightly deeper than Section 2 above and other related papers in the literature. We use the 'Broad Ancestral Category' field and aggregate from 135 combinations of 17 ancestries to 7 'broader ancestral categories' to calculate a comparable measure to Popejoy and Fullerton. To get a more detailed analysis of polyvocality, we load in some of our supporting functions to clean up the "INITIAL SAMPLE SIZE" and "REPLICATION SAMPLE SIZE" free-text fields in the catalogue and then calculate something analogous to Panofsky and Bliss. The supporting functions are based on regular expression and data wrangling techniques which exploit patterns in the these free-text fields. By way of a very simple example: “19,546 British ancestry individuals from 6863 families.” will get cleaned to two seperate fields: “19,546” and “British” which can then be used for further downstream analysis. A slightly more complex example: “2,798 European ancestry individuals, 228 French Canadian founder population individuals” will correspond to two entries of 2798 and 228 in the new ‘Cleaned N’ type variable, corresponding to ‘European’ and ‘French Canadian’ in the ‘Cleaned Ancestry’ type variable respectively. Uncomment the appropriate lines of text to get the full lists.

In [7]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Cat_Studies['InitialClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'INITIAL SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
              os.path.join('__file__',
                           '../..',
                           'data',
                           'Catalogue',
                           'Synthetic',
                           'new_initial_sample.csv'))
ancestry_parser(output_path, 'InitialClean', Cat_Studies)
Cat_Studies['ReplicationClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'REPLICATION SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
              os.path.join('__file__',
                           '../..',
                           'data',
                           'Catalogue',
                           'Synthetic',
                           'new_replication_sample.csv'))
ancestry_parser(output_path, 'ReplicationClean', Cat_Studies)
In [8]:
clean_intial = pd.read_csv(os.path.abspath(
                           os.path.join('__file__', '../..', 'data',
                                        'Catalogue', 'Synthetic',
                                        'new_initial_sample.csv')),
                           encoding='utf-8')
clean_initial_sum = pd.DataFrame(
    clean_intial.groupby(['Cleaned Ancestry']).sum())
clean_initial_sum.rename(
    columns={'Cleaned Ancestry Size': 'Initial Ancestry Sum', }, inplace=True)
clean_initial_count = clean_intial.groupby(['Cleaned Ancestry']).count()
clean_initial_count.rename(
    columns={'Cleaned Ancestry Size': 'Initial Ancestry Count', }, inplace=True)
clean_initial_merged = clean_initial_sum.merge(pd.DataFrame(
    clean_initial_count['Initial Ancestry Count']), how='outer', left_index=True,
    right_index=True)
clean_initial_merged = clean_initial_merged.sort_values(
    by='Initial Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_initial_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Initial Ancestry Sum']) + \
        ',' + str(row['Initial Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_initial_merged)) +
      ' ancestries found in the \'INITIAL SAMPLE SIZE\' column.' +
      '\nThese are: (number of people used used in all studies, ' +
      'number of studies included):\n\n' + holder[:-2] + '.\n\n')
There are a total of 208 ancestries found in the 'INITIAL SAMPLE SIZE' column.
These are: (number of people used used in all studies, number of studies included):

European (88427472,5809), Japanese (6484432,586), British (4273424,54), Icelandic (2290333,20), East Asian (1633062,208), African American (1600831,712), Finnish (927347,165), African (670551,161), Korean (569706,246), Han Chinese (542392,391), Hispanic (527725,247), South Asian (497157,53), Asian (294167,143), Hispanic/Latino (290293,63), Chinese (281647,171), Indian Asian (181580,44), Latin American (162910,42), African American/Afro Caribbean (161018,36), Indian (132082,54), Hispanic/Latino American (116226,12), African American or Afro Caribbean (102828,10), Hispanic or Latino (83557,4), European and Asian (79845,2), Dutch (75049,20), Taiwanese Han Chinese (72100,14), Filipino (68299,45), Swedish (67990,23), Sardinian (66472,24), European American (63371,37), Northern European (59894,17), Latino (54928,41), Pomak (53847,31), Rucphen (51402,28), Bangladeshi (48462,34), Hispanic/Latin American (48109,7), French (47074,63), Malay (46437,26), Mylopotamos (45756,31), Hispanic or Latin American (45635,12), Mexican (42743,25), German (42145,16), Pakistani (38610,11), European and East Asian (35524,2), European and South Asian (30482,2), Orcadian (30241,27), Mexican American (29291,42), Asian Indian (26798,9), Black (24820,29), Ashkenazi Jewish (23674,34), European and Hispanic (22604,1), European and African American (22593,2), Old Order Amish (21338,40), Italian (20264,12), Val Borbera (19754,15), European and Indian Asian (17967,1), Saudi Arab (17030,8), Singaporean Chinese (14920,24), South East Asian (14708,12), Northern Finnish (14289,3), Hispanic American (13835,16), Northwestern European (13483,2), British and Irish (11375,2), Western European (11043,3), African British (10206,4), Sub Saharan African (9377,25), Brazilian (8761,19), Ugandan (8583,9), African American and African (8298,2), Danish (8247,4), Mexican and Latin American (8214,2), Thai (8143,18), Korkulan (7706,11), Turkish (7638,14), French Canadian (7508,13), Austrian (7407,9), Puerto Rican (7096,9), Sorbian (6982,8), Mainland Hispanic/Latino (6734,1), Fruili Venezia Giulia (6634,7), European and  Rucphen (6597,1), Vietnamese (6469,4), Polish (6413,13), Lebanese (6308,6), Norwegian (5704,6), Caribbean Hispanic (5607,4), Caribbean Hispanic/Latino (5458,1), Yoruban (5115,7), Gambian (5038,4), Amish (5035,10), Latino American (4912,4), South Tyrol (4728,5), Latino or African (4658,9), Punjabi Sikh (4619,5), Southern Chinese (4582,3), Indo European (4238,4), Kosraen (4168,2), Taiwanese (3946,10), Hutterite (3776,9), American Indian (3704,12), Carlantino (3667,10), Nigerian (3564,3), West African (3434,5), South Indian (3265,4), Pima Indian (3251,8), Afro Caribbean (3201,4), Samoan (3072,1), African American or European (3054,2), Tibetan (3043,2), Tanzanian (3041,5), African American/African Caribbean (3015,4), Kenyan (2857,2), Native American North American (2835,2), North Indian (2692,4), Sereer (2640,5), Micronesian (2346,1), Celtic (2307,1), Oceania (2229,2), Cilento (2163,3), Indonesian (2041,5), Sirente (2002,7), Middle Eastern (1950,2), Black Hispanic (1940,4), Native Hawaiian (1902,6), Moroccan (1851,9), Japanese American (1697,3), African American and African Caribbean (1612,1), Surinamese (1553,7), Talana (1488,3), African and Asian (1434,8), African and African Arab (1215,4), Spanish (1193,4), Sri Lankan Sinhalese (1154,4), Slavic (1102,1), Southern European (1090,1), Latin/South American (1050,2), Split (986,2), African Caribbean (929,2), Ogliastran (897,1), Ethiopian (895,6), Native American South American (875,2), African Arab (853,4), Arab (836,6), Mongolian (756,1), Martu Australian Aboriginal (752,3), Central European (744,1), South African (733,2), Uyghur (721,1), Tyrolean (696,1), Mixed (684,1), South African Black (637,2), Thai Chinese (618,2), Papua New Guinean (576,2), Bulgarian (564,2), European American and Mexican American (540,1), Vietnamese Korean (518,1), Costa Rican (506,2), Jewish (497,2), Isreali Arab (496,6), Middle Eastern/North African (482,4), Taiwanese Han (470,2), Irish (432,2), Malaysian Chinese (371,2), Russian (366,6), Iranian (352,2), Silk Road (348,1), Finnish Saami (347,1), Maghrebian (346,2), Jewish Isreali (331,2), North African (329,7), Plains American Indian (322,1), Fijian Indian (319,2), Hong Kong Chinese (315,1), Norfolk Island (285,2), Greater Middle Eastern (281,2), Asian American (261,8), South African Afrikaner (251,2), Hispanic/Native American (237,1), Tatar (231,2), Southern Indian (229,2), Malawian (226,2), Tunisian (221,2), Portuguese (221,2), Asian or Pacific Islander (207,3), Arab Isreali (189,1), Hispanic and European (180,2), Hispanic/Native (164,1), Bashkir (161,2), Han Chinese American (156,2), Hispanic Caucasian (136,2), Nepalese (99,2), Oriental (90,1), Native American (89,7), Solomon Islander (85,2), Uygur Chinese (83,1), Dai Chinese (58,1), European and Mexican (41,2), Jingpo Chinese (40,1), Romanian (32,1), European/Asian (27,2), Caucasian Eastern Mediterranean (26,2), Native Hawaiian or Pacific (14,2), Hui Chinese (11,1), Cape Verdian (4,2), Native American or Alaska Native (4,2), Isreali (2,2), Curacao (2,2), Dominican Republic (2,2), Afghanistan (2,2).


In [9]:
clean_replication = pd.read_csv(os.path.abspath(
                                os.path.join('__file__',
                                             '../..',
                                             'data',
                                             'Catalogue',
                                             'Synthetic',
                                             'new_replication_sample.csv')),
                                encoding='utf-8')
clean_replication_sum = pd.DataFrame(
    clean_replication.groupby(['Cleaned Ancestry']).sum())
clean_replication_sum.rename(
    columns={'Cleaned Ancestry Size': 'Replication Ancestry Sum', }, inplace=True)
clean_replication_count = clean_replication.groupby(
    ['Cleaned Ancestry']).count()
clean_replication_count.rename(
    columns={'Cleaned Ancestry Size': 'Replication Ancestry Count', }, inplace=True)
clean_replication_merged = clean_replication_sum.merge(
    pd.DataFrame(clean_replication_count['Replication Ancestry Count']),
    how='outer', left_index=True, right_index=True)
clean_replication_merged = clean_replication_merged.sort_values(
    by='Replication Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_replication_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Replication Ancestry Sum']) + \
        ',' + str(row['Replication Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_replication_merged)) +
      ' ancestries found in the \'REPLICATION SAMPLE SIZE\' column.' +
      ' These are (number of people used used in all studies, number of' +
      ' studies included):\n\n' + holder[:-2] + '.')
There are a total of 148 ancestries found in the 'REPLICATION SAMPLE SIZE' column. These are (number of people used used in all studies, number of studies included):

European (36890259,3031), Asian (3387710,110), East Asian (2131174,240), Icelandic (1181163,17), Japanese (1061251,296), Han Chinese (1026731,258), African American (753265,350), Chinese (427391,185), South Asian (403673,61), Hispanic (397124,146), Korean (306162,146), British (251083,3), African (192176,76), European and Asian (136010,2), Indian Asian (118255,14), African American/Afro Caribbean (104015,48), European and African American (99227,9), European and East Asian (87155,5), Finnish (83580,24), Hispanic/Latino (62435,12), Indian (50713,47), Hispanic/Latino American (43200,6), European and Middle East/North African (29807,2), African America/Afro Caribbean (27661,1), Mexican (27148,17), Sardinian (25831,13), Latino (25299,17), Sub Saharan African (23752,26), European and  Rucphen (22789,1), Malay (22036,27), Brazilian (21631,17), Filipino (21546,23), African American and African (20809,1), Dutch (20367,3), Black and European (19090,1), Mexican American (18703,24), Northern European (16962,3), Ashkenazi Jewish (15801,16), European and Indian Asian (13615,1), African British (13364,3), Asian Indian (12906,7), Indo European (12659,5), African American and Afro Caribbean (11583,3), Southern Chinese (11058,3), Old Order Amish (10887,12), Punjabi Sikh (10261,5), Thai (10032,25), African American or Afro Caribbean (8972,3), Turkish (8855,11), Hispanic/Latin American (8622,5), European American (8614,3), Rucphen (8337,7), Mexican/Latino (8214,2), Saudi Arab (7692,12), Pima Indian (7495,9), American Indian (7039,7), French Canadian (6846,15), Swedish (6366,4), Pakistani (6115,6), German (5569,4), Vietnamese (5490,4), Indonesian (5253,7), Silk Road (5017,11), West African (4780,7), Southern Indian (4600,2), Middle Eastern (4591,5), Danish (4258,2), Black (4111,12), Singaporean (3968,2), North Indian (3956,4), Russian (3950,2), Singaporean Chinese (3580,2), Gambian (3463,2), Dravidian (3260,4), Seychellois (3258,14), She Chinese (3235,1), Iranian (3210,9), Northwestern European (2942,2), Hispanic / Latino (2826,1), Talana (2508,3), Amish (2429,4), African American and African Caribbean (2147,1), Arab (2120,13), Samoan (2102,1), European and Ashkenazi Jewish (2035,2), African American and European (1893,1), Polish (1810,9), Costa Rican (1778,3), South and Central American (1769,1), Cilento (1709,2), Hispanic and European (1682,2), North African (1566,10), Polynesian (1536,2), Singaporean Indian (1520,1), Nepalese (1478,6), Caribbean Hispanic (1390,6), Latin American (1386,6), Oceania (1372,5), Southern European (1296,1), Native Hawaiian (1277,2), Hutterite (1255,2), Moroccan (1247,3), Latino American (1242,2), Indigenous Mexican (1178,2), European and Hispanic (1174,4), Jamaican (1089,2), Orcadian (1035,2), Afro Caribbean (962,1), Fruili Venezia Giulia (957,2), Val Borbera (910,1), Uygur Kazakh Chinese (840,2), Asian American (670,2), Spanish (662,2), Sorbian (659,1), Mongolian (542,2), Portuguese (525,2), South African Black (481,2), Mayan (475,2), Bulgarian (450,2), Ethiopian (433,6), Malaysian Chinese (420,2), Taiwanese (400,1), Middle Eastern Arab (352,2), Jewish (344,2), Carlantino (322,1), Afro Caribbean and Sub Saharan African (321,1), Surinamese (284,1), Hispanic and Native American (282,1), French (276,1), Western European (253,1), Malaysian (212,2), Arab Isreali (198,2), Tibetan (161,1), Yoruban (146,2), Uyghur (99,2), Ashkenazi Jewish Isreali (96,2), Hispanic American (85,1), Jewish Isreali (74,2), European/Asian (74,2), African America and European (68,1), Italian (45,2), African and Asian (38,2), Aboriginal Canadian (15,2), Native Hawaiian or Pacific (7,1), Black African (5,1), Native American or Alaska Native (4,1), South African (2,1), Colored African (1,1).

Lets aggregate to create a dataframe for natives/indigenous people using a dictionary based classifier

In [10]:
native_dict = pd.read_csv(os.path.abspath(os.path.join('__file__', '../..',
                                         'data', 'Support',
                                         'native_classifier_dictionary.csv')),
                            index_col='Cleaned Ancestry')
native_df = pd.merge(clean_replication_merged, clean_initial_merged,
                   left_index=True,right_index=True, how='outer')
if len(np.setdiff1d(native_df.index.values, native_dict.index.values))>0:
    print('There are still ancestries which need classifying as unique!'+
         ' Add them to the dictionary: ' +
         ', '.join(np.setdiff1d(native_df.index.values,native_dict.index.values)))
else:
    print('Congratulations! The Native/Indigenous dictionary is fully up to date!')
native_df = native_df.fillna(0)
native_df['Total Ancestry Sum'] = native_df['Replication Ancestry Sum'] + \
                                  native_df['Initial Ancestry Sum']
native_df['Total Ancestry Count'] = native_df['Replication Ancestry Count'] + \
                                    native_df['Initial Ancestry Count']
native_df = pd.merge(native_df,native_dict,left_index=True, right_index=True)
print('We observe ' + str(len(native_df)) + ' unique terms for ancestry/ethnicity in total.')
print('Of these, ' + str(len(native_df[native_df['Native Population']!='Not Native'])) + 
      ' relate to native or Indigenous populations ancestry/ethnicity.')
Congratulations! The Native/Indigenous dictionary is fully up to date!
We observe 236 unique terms for ancestry/ethnicity in total.
Of these, 19 relate to native or Indigenous populations ancestry/ethnicity.
In [11]:
print('These relate to ' +
      str(len(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique())) + 
      ' specific native/indigenous groups as implied by the free text.')
print('These are: '+
     ', '.join(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
     ' mentions of ancestry\ethnicity, ' +
     str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum())) + 
     ' relate to native/indigenous groups as implied by the free text.' +
     '('+
      str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum()/
      native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
     ' participants, ' +
     str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum())) + 
     ' are of native/indigenous groups as implied by the free text.'  +
     '('+
      str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum()/
      native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text is: ' +
     native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
     ' which contributed ' + 
      str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
     ' people in terms of sample.')
These relate to 8 specific native/indigenous groups as implied by the free text.
These are: Aboriginal Canadians, Hispanic/Native American, Indigenous Mexicans, Indigenous Australian, Native Americans, Indigenous peoples of South America, Native Americans or Alaska Natives, Native Hawaiians
Out of 16289 mentions of ancestry\ethnicity, 34 relate to native/indigenous groups as implied by the free text.(0.209%)
Out of 162033292 participants, 9353 are of native/indigenous groups as implied by the free text.(0.006%)
Most commonly used natives implied by the free text is: Native Hawaiian which contributed 3179 people in terms of sample.
In [12]:
print('These relate to ' +
      str(len(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique())) + 
      ' specific native/indigenous implied by the free text or which are manually classified.')
print('These are: '+
     ', '.join(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
     ' mentions of ancestry\ethnicity, ' +
     str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum())) + 
     ' relate to native/indigenous populations implied by the free text or which are manually classified.' +
     '('+
      str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum()/
      native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
     ' participants, ' +
     str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum())) + 
     ' are of native/indigenous populations implied by the free text or which are manually classified.' +
     '('+
      str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum()/
      native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text or which are manually classified is: ' +
     native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
     ' which contributed ' + 
      str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
     ' people in terms of sample.')
These relate to 16 specific native/indigenous implied by the free text or which are manually classified.
These are: Aboriginal Canadians, Native American, Pacific Islander, Bashkirs, Saami, Hispanic/Native American, Indigenous Mexicans, Indigenous Australian, Mayan, Native Americans, Indigenous peoples of South America, Native Americans or Alaska Natives, Native Hawaiians, Pima Indians, Plains Indians, Samoan
Out of 16289 mentions of ancestry\ethnicity, 81 relate to native/indigenous populations implied by the free text or which are manually classified.(0.497%)
Out of 162033292 participants, 37528 are of native/indigenous populations implied by the free text or which are manually classified.(0.023%)
Most commonly used natives implied by the free text or which are manually classified is: Pima Indian which contributed 10746 people in terms of sample.

Lets first now do a quick check that our 'Broader' dictionary is up to date:

In [13]:
if len(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]) > 0:
    print('Perhaps we need to update the dictionary for new terms? Something like these ones:\n' +
         '\n'.join(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]['BROAD ANCESTRAL'].unique()))
else:
    print('Nice! Looks like our Broad to Broader dictionary is up to date!')
Nice! Looks like our Broad to Broader dictionary is up to date!

Lets put this number into tables to get a deeper understanding, first including a row which has at least one ancestry as NR, and then dropping all rows in which at least one recorded ancestry is NR:

In [14]:
Broad_Ancestral_Full_initial = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'initial'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_initial.rename(
    columns={'N': 'N (Initial)'}, inplace=True)
Broad_Ancestral_Full_replication = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'replication'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_replication.rename(
    columns={'N': 'N (Replication)'}, inplace=True)
Broad_Ancestral_Full = pd.merge(Broad_Ancestral_Full_initial,
                                Broad_Ancestral_Full_replication,
                                left_index=True, right_index=True)
Broad_Ancestral_Full['Total'] = Broad_Ancestral_Full['N (Initial)'] + \
    Broad_Ancestral_Full['N (Replication)']
Broad_Ancestral_Full['% Discovery'] = (
    Broad_Ancestral_Full['N (Initial)'] /
    Broad_Ancestral_Full['N (Initial)'].sum()) * 100
Broad_Ancestral_Full['% Replication'] = (
    Broad_Ancestral_Full['N (Replication)'] /
    Broad_Ancestral_Full['N (Replication)'].sum()) * 100
Broad_Ancestral_Full['% Total'] = (Broad_Ancestral_Full['Total'] /
                                   Broad_Ancestral_Full['Total'].sum()) * 100
Broad_Ancestral_Full.to_csv(os.path.abspath(
                            os.path.join('__file__',
                                         '../..',
                                         'tables',
                                         'Broad_Ancestral_Full.csv')))
Broad_Ancestral_Full
Out[14]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader
African 366395 144837 511232 0.304245 0.262963 0.291290
African Am./Caribbean 2238062 1015713 3253775 1.858432 1.844105 1.853936
Asian 11109769 9038288 20148057 9.225281 16.409708 11.479960
European 96867242 38423715 135290957 80.436194 69.761215 77.086081
Hispanic/Latin American 1500873 665219 2166092 1.246288 1.207756 1.234196
In Part Not Recorded 7786463 4934147 12720610 6.465689 8.958324 7.247949
Other/Mixed 558627 856988 1415615 0.463870 1.555928 0.806589

Drop the 'in part not recorded' rows (i.e. where the Broad Ancestral Category contains at least one NR):

In [15]:
Broad_Ancestral_NoNR = Broad_Ancestral_Full[['N (Initial)',
                                             'N (Replication)',
                                             'Total']]
Broad_Ancestral_NoNR = Broad_Ancestral_NoNR.drop('In Part Not Recorded')
Broad_Ancestral_NoNR['Total'] = Broad_Ancestral_NoNR['N (Initial)'] + \
    Broad_Ancestral_NoNR['N (Replication)']
Broad_Ancestral_NoNR['% Discovery'] = (
    Broad_Ancestral_NoNR['N (Initial)'] /
    Broad_Ancestral_NoNR['N (Initial)'].sum()) * 100
Broad_Ancestral_NoNR['% Replication'] = (
    Broad_Ancestral_NoNR['N (Replication)'] /
    Broad_Ancestral_NoNR['N (Replication)'].sum()) * 100
Broad_Ancestral_NoNR['% Total'] = (Broad_Ancestral_NoNR['Total'] /
                                   Broad_Ancestral_NoNR['Total'].sum()) * 100
Broad_Ancestral_NoNR.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'tables',
                                         'Broad_Ancestral_NoNR.csv')))
Broad_Ancestral_NoNR
Out[15]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader
African 366395 144837 511232 0.325277 0.288838 0.314052
African Am./Caribbean 2238062 1015713 3253775 1.986899 2.025562 1.998809
Asian 11109769 9038288 20148057 9.862991 18.024392 12.377041
European 96867242 38423715 135290957 85.996457 76.625584 83.109839
Hispanic/Latin American 1500873 665219 2166092 1.332440 1.326597 1.330640
Other/Mixed 558627 856988 1415615 0.495936 1.709028 0.869619

What are the broad ancestral fields before we visualise the broader field? Dropping all rows for multiple (comma separated entries):

In [16]:
GroupAnc = Cat_Ancestry[(Cat_Ancestry['BROAD ANCESTRAL'] != 'Other') &
                        (Cat_Ancestry['BROAD ANCESTRAL'].str.count(',') == 0)].\
    groupby(['BROAD ANCESTRAL'])['N'].sum().to_frame()
GroupAnc['Individuals (%)'] = (GroupAnc['N'] /
                               GroupAnc['N'].sum()) * 100
GroupAnc.sort_values(by='Individuals (%)',ascending=False)[0:10]
Out[16]:
N Individuals (%)
BROAD ANCESTRAL
European 135290957 78.794470
East Asian 16118883 9.387759
NR 10500447 6.115539
African American or Afro-Caribbean 3253775 1.895023
Hispanic or Latin American 2166092 1.261548
Asian unspecified 2047676 1.192582
South Asian 1688748 0.983540
African unspecified 341925 0.199140
South East Asian 121225 0.070602
Sub-Saharan African 108866 0.063404

Lets now make some bubble plots to visualise the raw broad ancestry field and our fields extracted from the free text strings:

In [17]:
countriesdict = {'African': '#4daf4a', 'African Am./Caribbean': '#984ea3',
                 'Asian': '#e41a1c', 'European': '#377eb8',
                 'Hispanic/Latin American': '#ff7f00', 'Other/Mixed': '#ffff33'}
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures", "svg",
                                             "Figure_2.svg"))
plot_bubbles(output_path, Cat_Ancestry, Broad_Ancestral_NoNR, countriesdict)
print('The biggest single sample description is ' + str(Cat_Ancestry['N'].max()) + '.')
The biggest single sample description is 586626.

We can also analyze how these aggregates change over time, and this was a key feature of Popejoy and Fullerton (2016). We can provide a much more granular arguement (merely remove '_NoNR' from the cell below to undertake an equivlent analysis with rows which contain some in part NR ancestries):

In [18]:
index = [x for x in range(2007, 2019)]
columns = ['European', 'Asian', 'African', 'Hispanic/Latin American',
           'Other/Mixed', 'African Am./Caribbean']
Cat_Ancestry_NoNR = Cat_Ancestry[
    Cat_Ancestry['Broader'] != 'In Part Not Recorded']
Broad_Ancestral_Time_NoNR_PC = pd.DataFrame(index=index, columns=columns)
for year in range(2007, 2019):
    for broaderancestry in Broad_Ancestral_NoNR.index.tolist():
        Broad_Ancestral_Time_NoNR_PC[broaderancestry.strip()][year] =\
            (Cat_Ancestry_NoNR[(
                Cat_Ancestry_NoNR['DATE'].str.contains(str(year))) &
                (Cat_Ancestry_NoNR['Broader'] ==
                 broaderancestry)]['N'].sum() /
             Cat_Ancestry_NoNR[
             (Cat_Ancestry_NoNR['DATE'].str.contains(
                 str(year)))]['N'].sum()) * 100
Broad_Ancestral_Time_NoNR_PC.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'tables',
                                         'Broad_Ancestral_Time_NoNR_PC.csv')))
Broad_Ancestral_Time_NoNR_PC.style
Out[18]:
European Asian African Hispanic/Latin American Other/Mixed African Am./Caribbean
2007 95.4688 2.13984 0.00542792 0.715245 1.18391 0.486807
2008 95.2884 2.94597 0 0.00180063 1.21894 0.544872
2009 88.1708 7.10488 0.258137 0.224117 3.36046 0.881605
2010 86.8535 9.89447 0.266794 0.05829 2.43575 0.491204
2011 78.2316 15.8233 0.15679 0.397565 1.70889 3.68185
2012 72.3664 19.5695 0.31463 0.729346 2.88951 4.13058
2013 82.1979 11.6865 0.394273 0.787581 0.617715 4.31598
2014 76.6056 18.6152 0.250182 1.15204 0.97794 2.39904
2015 88.0646 9.11528 0.280238 0.738889 0.510574 1.29045
2016 90.785 4.65017 0.167666 1.46566 1.09729 1.83424
2017 88.1308 6.24258 0.569139 2.29557 0.662464 2.09946
2018 72.1842 24.5397 0.272253 1.34697 0.0988922 1.55803

We can focus on the initial discovery stage to calculate the number of individuals required per ancestry to unconver one hit. Note however that this does require some distributional assumptions (i.e. that the same diseases are being studied across ancestries) and assumptions about overlap in contribution to cohorts.

In [19]:
Cat_Ancestry_Initial = Cat_Ancestry[Cat_Ancestry['STAGE'] == 'initial']

Cat_Ancestry_NoDups = Cat_Ancestry_Initial.drop_duplicates(
    subset=['STUDY ACCESSION'],
    keep=False)[['PUBMEDID', 'STUDY ACCESSION',
                 'Broader','N']]
Cat_Ancestry_NoDups_merge = pd.merge(Cat_Ancestry_NoDups,
                                     Cat_Studies[['ASSOCIATION COUNT',
                                                  'STUDY ACCESSION',
                                                  'MAPPED_TRAIT']],
                                     how='left', on='STUDY ACCESSION')

listtoiterate = ['European', 'Other/Mixed', 'Asian','African',
                 'African Am./Caribbean', 'Hispanic/Latin American']
for ancestry in listtoiterate:
    temp = Cat_Ancestry_NoDups_merge[Cat_Ancestry_NoDups_merge[
        'Broader'].str.strip() == ancestry]
    print('The number of ' + ancestry + 's required to find one hit: ' +
          str(round(1 /
                    (temp['ASSOCIATION COUNT'].sum() / temp['N'].sum()), 1)))
The number of Europeans required to find one hit: 1488.6
The number of Other/Mixeds required to find one hit: 482.5
The number of Asians required to find one hit: 1355.6
The number of Africans required to find one hit: 779.0
The number of African Am./Caribbeans required to find one hit: 436.8
The number of Hispanic/Latin Americans required to find one hit: 393.4
In [20]:
 Cat_Ancestry_Initial['Broader'].unique()
Out[20]:
array(['European', 'Other/Mixed', 'Asian', 'In Part Not Recorded',
       'African Am./Caribbean', 'Hispanic/Latin American', 'African'],
      dtype=object)

Choropleth map of country of recruitment

This is a choropleth map of country of recruitment - note that this field is not quite fully populated in the Catalogue. Basemap code is loosely based on this. As we want to study the number of people recruited from each country, we can only utilize values in the ‘Country of Recruitment’ field when only one country is mentioned. For example: if N=100,000 and Country of Recruitment = {U.K., U.S.}, there is no way for us to know what the breakdown between the two countries is in the Catalogue (especially as the free text field may just contain 'European' ancestry). Our only option in this scenario is to drop such observations. We are faced with multiple entries equal to ‘NR’, which corresponds to ‘not recorded’. This reduces the number of studies to go into the map further 21%. This has no relationship to whether ancestry is coded, and the make_clean_CoR function summarizes these dropped rows below.

In [21]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Clean_CoR = make_clean_CoR(Cat_Ancestry)
Cleaned_CoR_N = pd.DataFrame(Clean_CoR.groupby(['Cleaned Country'])['N'].sum())
Cleaned_CoR_N['Rank_N'] = Cleaned_CoR_N.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = pd.DataFrame(
    Clean_CoR.groupby(['Cleaned Country'])['N'].count())
Cleaned_CoR_count['Rank_count'] = Cleaned_CoR_count.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = Cleaned_CoR_count.rename(columns={'N': 'Count'})
Cleaned_CoR = pd.merge(Cleaned_CoR_count, Cleaned_CoR_N,
                       left_index=True, right_index=True)
del Cleaned_CoR.index.name
Cleaned_CoR['Count (%)'] = round((Cleaned_CoR['Count'] /
                                   Cleaned_CoR['Count'].sum()) * 100,2)
Cleaned_CoR['N (%)'] = round((Cleaned_CoR['N'] /
                               Cleaned_CoR['N'].sum()) * 100,2)
countrylookup = pd.read_csv(os.path.abspath(
                            os.path.join('__file__',
                                         '../..',
                                         'data',
                                         'ShapeFiles',
                                         'Country_Lookup.csv')),
                            index_col='Country')
country_merged = pd.merge(countrylookup,
                          Cleaned_CoR,
                          left_index=True,
                          right_index=True)
country_merged['Per Rec'] = round(country_merged['N']/pd.to_numeric(country_merged['2017population']),2)
country_merged.sort_values('Rank_N', ascending=True)[
    ['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].to_csv(
    os.path.abspath(os.path.join('__file__',
                                 '../..',
                                 'tables',
                                 'CountryOfRecruitment.csv')))
Cleaning for single country of recruitment:
54.68% of the rows remain.
30.81% of the N remains.

Lets visualise this table, where the 'Per Rec' column relates to the population adjusted metric:

In [22]:
country_merged.sort_values('Rank_N', ascending=True)[
    ['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].head(10)
Out[22]:
Continent Count N Count (%) N (%) Per Rec
United Kingdom Europe 646 21952515 10.42 40.59 0.33
United States North America 2547 10620180 41.10 19.64 0.03
Japan Asia 464 7458538 7.49 13.79 0.06
Iceland Europe 71 6409109 1.15 11.85 19.13
China Asia 499 2059415 8.05 3.81 0.00
Finland Europe 218 1193333 3.52 2.21 0.22
Korea, South Asia 254 850489 4.10 1.57 0.02
Netherlands Europe 175 663477 2.82 1.23 0.04
Germany Europe 173 424942 2.79 0.79 0.01
Australia Oceania 110 320458 1.78 0.59 0.01
In [23]:
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_3.svg'))
choropleth_map(country_merged, 'N', 'OrRd', output_path)

Lets instead set the colourmap instead by the 'Per Rec' field:

In [24]:
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_3_Blues.svg'))
choropleth_map(country_merged, 'Per Rec', 'Blues', output_path)

Lets now merge that via a country lookup to continent based file using data from within the shapefile itself (based on CIAWF).

In [25]:
country_merged_sumcount = country_merged[[
    'Continent', 'Count']].groupby(['Continent']).sum()
country_merged_sumN = country_merged[[
    'Continent', 'N']].groupby(['Continent']).sum()
country_merged_sums = pd.merge(
    country_merged_sumN, country_merged_sumcount,
    left_index=True, right_index=True)
country_merged_sums['N (%)'] = (
    country_merged_sums['N'] / country_merged_sums['N'].sum()) * 100
country_merged_sums['Count (%)'] = (
    country_merged_sums['Count'] / country_merged_sums['Count'].sum()) * 100
country_merged_sums.to_csv(os.path.abspath(
                           os.path.join('__file__',
                                        '../..',
                                        'tables',
                                        'ContinentOfRecruitment.csv')))
country_merged_sums
Out[25]:
N Count N (%) Count (%)
Continent
Africa 65089 63 0.120359 1.016621
Asia 11011767 1539 20.362407 24.834597
Europe 31904248 1804 58.995736 29.110860
North America 10739057 2642 19.858126 42.633532
Oceania 332398 119 0.614654 1.920284
Seven seas (open ocean) 1389 1 0.002568 0.016137
South America 24957 29 0.046149 0.467968

Who funds GWAS and what do they fund?

A simple descriptive tabulation

Lets do a super simple tabulation of the Top 20 GWAS funders (measured by 'GWAS contributions': i.e. a funder funding an authors time on a paper).

In [26]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
AllFunders = FunderInfo.groupby(by='Agency').count()
AllFunders.index.name = None
AllFunders = AllFunders.reset_index()
AllFunders = AllFunders.rename(columns={
                               'index': 'Agency',
                               'PUBMEDID': 'Grant Contributions',
                               'GrantCountry': 'Country'})
AllFunders_withcountries = pd.merge(AllFunders[['Agency',
                                                'Grant Contributions']],
                                    FunderInfo[['Agency',
                                                'GrantCountry']].drop_duplicates(
                                        'Agency'),
                                    on='Agency', how='left')
AllFunders_withcountries = AllFunders_withcountries.set_index('Agency')
AllFunders_withcountries.index.name = None
AllFunders_withcountries['% of Total'] = round((
    AllFunders_withcountries['Grant Contributions'] /
    AllFunders_withcountries['Grant Contributions'].sum()) * 100, 2)
AllFunders_withcountries.sort_values(
    'Grant Contributions', ascending=False)[0:20]
Out[26]:
Grant Contributions GrantCountry % of Total
NHLBI NIH HHS 12678 United States 25.89
NCI NIH HHS 5163 United States 10.55
NIA NIH HHS 4111 United States 8.40
MRC 3545 United Kingdom 7.24
NIDDK NIH HHS 2702 United States 5.52
NIMH NIH HHS 2669 United States 5.45
Wellcome Trust 1828 United Kingdom 3.73
NHGRI NIH HHS 1799 United States 3.67
NCRR NIH HHS 1568 United States 3.20
NIAID NIH HHS 1190 United States 2.43
PHS HHS 1141 United States 2.33
NIAMS NIH HHS 1060 United States 2.17
NIAAA NIH HHS 904 United States 1.85
WHI NIH HHS 775 United States 1.58
NINDS NIH HHS 769 United States 1.57
NIDA NIH HHS 699 United States 1.43
NCATS NIH HHS 697 United States 1.42
NIGMS NIH HHS 607 United States 1.24
Cancer Research UK 604 United Kingdom 1.23
Intramural NIH HHS 587 United States 1.20

Lets print out some simple descriptives from this data:

In [27]:
print('There are ' + str(len(FunderInfo['Agency'].drop_duplicates())) +
      ' unique funders returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantID'].drop_duplicates())) +
      ' unique grants returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantCountry'].drop_duplicates())) +
      ' unique grant countries returned from PubMed Central.')
print('Each study has an average of ' +
      str(round(len(FunderInfo) / len(id_list), 2)) + ' grants funding it.')
grantgroupby = FunderInfo.groupby(['Agency', 'GrantID']).size().groupby(
    level=1).max().sort_values(ascending=False).reset_index()
print('The most frequently acknowledged grant is GrantID ' +
      grantgroupby['GrantID'][0] + '.\nThis grant is acknowledged ' +
      str(grantgroupby[0][0]) + ' times.')
There are 133 unique funders returned from PubMed Central.
There are 12743 unique grants returned from PubMed Central.
There are 8 unique grant countries returned from PubMed Central.
Each study has an average of 13.62 grants funding it.
The most frequently acknowledged grant is GrantID P30 DK063491.
This grant is acknowledged 207 times.

International distribution of funders

From which countries do these grants come from?

In [28]:
TopCountryFunders = FunderInfo.groupby(by='GrantCountry').count()
TopCountryFunders = TopCountryFunders.rename(
    columns={'PUBMEDID': 'Number Of Studies'})
TopCountryFunders = TopCountryFunders.sort_values(
    'Number Of Studies', ascending=False)[['Number Of Studies']]
TopCountryFunders = TopCountryFunders.reset_index().rename(
    columns={'GrantCountry': 'Country of Funder'})
TopCountryFunders['Percent Funded (%)'] = (
    TopCountryFunders['Number Of Studies'] /
    TopCountryFunders['Number Of Studies'].sum()) * 100
TopCountryFunders = TopCountryFunders.set_index('Country of Funder')
TopCountryFunders.index.name = None
TopCountryFunders.head(10)
Out[28]:
Number Of Studies Percent Funded (%)
United States 41568 85.077468
United Kingdom 7035 14.398575
Canada 177 0.362267
International 66 0.135083
Austria 7 0.014327
Italy 5 0.010234
Switzerland 1 0.002047

What about the commas?

Lets first get a metric about splitting on the commas in the EFO field. We can make identical heatmaps by just replacing EFO_Parent_Mapped with EFO_Parent_Mapped_NoCommas. This is inherently making the assumption that a GWAS which has two mentioned EFO contributes equally to each of them: as much as a GWAS with one EFO (i.e. no comma to split on) contributes to that one specific EFO.

In [29]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)
print('There are ' + str(len(EFO_Parent_Mapped)) +
      ' rows of EFO terms after we split for commas.')
print('This indicates ' + str(len(EFO_Parent_Mapped) -\
                              len(EFO_Parent_Mapped_NoCommas)) +
      ' additional terms were mapped than for when we drop csvs.')
print('This indicates ' +
      str(len(EFO_Parent_Mapped['EFO term'].drop_duplicates())) +
      ' unique EFO terms to map to Parents.')
print('This is in comparison to ' +
      str(len(EFO_Parent_Mapped_NoCommas['EFO term'].drop_duplicates())) +
      ' unique EFO terms in Cat_Studies.')
There are 7954 rows of EFO terms after we split for commas.
This indicates 3632 additional terms were mapped than for when we drop csvs.
This indicates 1955 unique EFO terms to map to Parents.
This is in comparison to 1379 unique EFO terms in Cat_Studies.

What do they study?

In [30]:
mapped_trait_summary(EFO_Parent_Mapped, 'EFO term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas,'EFO term')
                             Sample   Studies  Associations
EFO term                                                   
Type II Diabetes Mellitus  1.702527  1.248266      0.677679
Body Mass Index            4.019300  1.210440      2.827140
Schizophrenia              0.901475  1.185223      1.162481
Breast Carcinoma           0.960132  0.970874      0.740814
Unipolar Depression        2.499326  0.958265      0.439044
Bipolar Disorder           0.359499  0.958265      0.326676
Alzheimers Disease         0.660883  0.933048      0.464529
HD LC measurement          0.787287  0.882613      1.088342
Colorectal Cancer          0.640769  0.743916      0.275126
Asthma                     0.323274  0.743916      0.318567
Triglyceride Meas.         0.752732  0.743916      1.036212
Prostate Carcinoma         0.687592  0.706090      0.384598
LD LC measurement          0.682481  0.693481      0.948751
Rheumatoid Arthritis       0.612218  0.580003      0.323780
Bone Density               0.285821  0.567394      0.889092
Coronary Heart Disease     0.495850  0.567394      0.155808
Systolic Blood Pressure    2.900267  0.542176      1.064594
                             Sample   Studies  Associations
EFO term                                                   
Type II Diabetes Mellitus  2.730676  1.488718      1.247117
Alzheimers Disease         0.613739  1.232845      0.424419
Body Mass Index            3.405084  1.116539      2.794095
Breast Carcinoma           1.291137  1.070016      1.783792
HD LC measurement          1.084956  1.000233      0.944180
Prostate Carcinoma         0.985917  0.976971      0.764263
Schizophrenia              0.740789  0.930449      1.743811
Colorectal Cancer          0.799914  0.930449      0.530524
Bone Density               0.446477  0.883926      0.864217
LD LC measurement          0.950119  0.837404      0.784253
Asthma                     0.477428  0.744359      0.393664
Bipolar Disorder           0.333635  0.721098      0.290635
Triglyceride Meas.         1.007912  0.721098      0.644318
Unipolar Depression        1.927662  0.697837      0.684300
Rheumatoid Arthritis       0.445863  0.651314      0.744272
Body Height                1.263059  0.651314      1.262494
Multiple Sclerosis         0.220214  0.604792      0.552053
In [31]:
mapped_trait_summary(EFO_Parent_Mapped, 'Parent term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas, 'Parent term')
                              Sample    Studies  Associations
Parent term                                                  
Other Meas.                32.183505  29.933174     40.221723
Neurological Disorder       9.101095  10.629177      4.949956
Other Disease               4.716185   7.678729      3.750405
Cancer                      6.174897   6.657420      3.109216
Response To Drug            1.952623   6.392636      3.901001
Lipid/Lipoprotein Meas.     3.339298   4.917413      5.939252
Other Trait                 3.699485   4.841760      2.237501
Digestive System Disorder   3.301248   4.337410      2.321486
Cardiovascular Disease      7.048362   4.148279      1.804828
Immune System Disorder      2.770669   3.656538      2.261828
Biological Process          4.617704   3.606103      5.121403
Cardiovascular Meas.        3.672297   3.404363      3.307307
Body Meas.                  9.066682   3.190014      9.960729
Hematological Meas.         4.538392   2.761316      5.211760
Metabolic Disorder          2.486578   2.483924      0.951647
Inflammatory Meas.          0.875550   0.907830      3.532042
Liver Enzyme Meas.          0.455430   0.453915      1.417914
                              Sample    Studies  Associations
Parent term                                                  
Other Meas.                28.711733  29.192836     33.132400
Neurological Disorder       8.863972  11.421261      7.322774
Other Disease               5.388455   8.374040      5.960326
Cancer                      7.567030   7.885555      5.675842
Lipid/Lipoprotein Meas.     4.333443   4.791812      4.271875
Cardiovascular Disease      8.965029   4.745290      3.358450
Other Trait                 3.675101   4.628983      2.937106
Digestive System Disorder   2.874193   4.373110      3.967400
Immune System Disorder      2.643363   4.373110      3.924343
Cardiovascular Meas.        4.444092   3.814841      2.594187
Hematological Meas.         6.716879   3.628751     11.720744
Biological Process          3.219036   3.349616      5.331386
Body Meas.                  7.909619   3.210049      5.734276
Metabolic Disorder          3.150736   2.465690      1.709980
Response To Drug            0.122694   2.442428      1.085653
Inflammatory Meas.          0.742754   0.860665      0.907274
Liver Enzyme Meas.          0.671871   0.441963      0.365985

Who's funding what?

In [32]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped['Parent term'] = \
    EFO_Parent_Mapped['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
funder_parent.to_csv(os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'tables',
                                           'Funders_and_Ancestry.csv')))
funder_ancestry.to_csv(os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'tables',
                                           'Funders_and_BroadEFO.csv')))
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_4.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)

Lets do the same thing, but instead do not split on the EFO commas:

In [33]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped_NoCommas['Parent term'] = \
    EFO_Parent_Mapped_NoCommas['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped_NoCommas,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
output_path = os.path.abspath(os.path.join('__file__', '../..', 'figures',
                                         'svg', 'Figure_4_NoCommas.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)

Who are the Authors?

Who are the most cited authors? What is their 'GWAS H-Index' calculated based only on their papers in the GWAS catalogue? This assumes unique forename + surname combinations, and that the same author is recorded consistently across studies. First a couple of snippets:

In [34]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are a total of ' + str(len(AuthorMaster)) +
      ' "authorship contributions".')
print('These contributions are made by ' +
      str(len(AuthorMaster['AUTHORNAME'].unique())) + ' unique authors.')
print('There are a total of ' +
      str(len(AuthorMaster) /
          len(AuthorMaster['PUBMEDID'].drop_duplicates())) +
      ' "authorship contributions per paper".')
print('The study with the most number of authors has ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().max()) +
      ' authors (PubMed ID: ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().idxmax()) + ')')
There are a total of 120994 "authorship contributions".
These contributions are made by 39707 unique authors.
There are a total of 33.79720670391062 "authorship contributions per paper".
The study with the most number of authors has 559 authors (PubMed ID: 22479202)

Calculate 'GWAS-H' indexes

Lets then calculate our GWAS-H indexes using citation data and paper counts for unique authors.

In [35]:
CitationCounts = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..",
                 "data", "PUBMED", 'Pubmed_Cites.csv')))
AuthorMaster_withcites = pd.merge(
    AuthorMaster, CitationCounts, on=['PUBMEDID'], how='left')
allauthors_formerge = pd.DataFrame(
    AuthorMaster_withcites[['AUTHORNAME', 'citedByCount']])
allauthors_papercount = pd.DataFrame(
    allauthors_formerge['AUTHORNAME'].value_counts())
allauthors_citecount = pd.DataFrame(
    allauthors_formerge.groupby(by='AUTHORNAME')['citedByCount'].sum())
allauthors_merged = pd.merge(
    allauthors_papercount, allauthors_citecount, left_index=True,
    right_index=True)
allauthors_merged.columns = ['Papers', 'citedByCount']
allauthors_merged = allauthors_merged.sort_values(
    by='citedByCount', ascending=False)
allauthors_merged['GWAS-H'] = 0
counter = 0
for author in allauthors_merged.index:
    counter += 1
    temp = AuthorMaster_withcites[AuthorMaster_withcites['AUTHORNAME'] ==
                                  author].sort_values(by='citedByCount',
                                                      ascending=False).dropna()
    temp = temp.reset_index()
    temp = temp.drop('index', 1)
    for pubnumber in range(0, len(temp)):
        if pubnumber + 1 > temp['citedByCount'][pubnumber]:
            break
        else:
            allauthors_merged.loc[author, ('GWAS-H')] = int(pubnumber+1)
    sys.stdout.write('\r' + 'Calculating GWAS H-indices: finished ' +
                     str(counter) + ' of ' +
                     str(len(allauthors_merged.index)) + ' authors...')
allauthors_citecount.reset_index(inplace=True)
allauthors_papercount.reset_index(inplace=True)
allauthors_papercount.rename(
    columns={'AUTHORNAME': 'PAPERCOUNT', 'index': 'AUTHORNAME', },
    inplace=True)
Calculating GWAS H-indices: finished 39707 of 39707 authors...

Compare Author Metrics Between Genders

Calculate author centralities

Lets next calculate some measures of authorship centrality with Network-X. Note: this can take some time to calculate. To get it to run faster, change the CitedByCount to be something like 100 to filter for authors with a minimum of a hundred citations only (or change PAPERCOUNT>1)

In [36]:
AuthorMaster_sumcites = pd.merge(AuthorMaster, allauthors_citecount,
                                 left_on='AUTHORNAME', right_on='AUTHORNAME',
                                 how='left')
AuthorMaster_sumcitespapercount = pd.merge(AuthorMaster_sumcites,
                                           allauthors_papercount,
                                           left_on='AUTHORNAME',
                                           right_on='AUTHORNAME', how='left')
AuthorMaster_sumcitespapercount_filter_cites = AuthorMaster_sumcitespapercount[
    AuthorMaster_sumcitespapercount['citedByCount'] > 10]
AuthorMaster_sumcitespapercount_filtered =\
    AuthorMaster_sumcitespapercount_filter_cites[
        AuthorMaster_sumcitespapercount_filter_cites['PAPERCOUNT'] > 1]

G = nx.Graph()
counter = 0
alledges = []
for paper in AuthorMaster_sumcitespapercount_filtered['PUBMEDID'].unique():
    counter += 1
    temp = AuthorMaster_sumcitespapercount_filtered[
        AuthorMaster_sumcitespapercount_filtered['PUBMEDID'] == paper]
    if len(temp) > 1:
        templist = list(itertools.combinations(temp.AUTHORNAME, 2))
        for edge in templist:
            alledges.append(edge)
G.add_edges_from(list(set(alledges)))

print('This gives us a network with ' + str(len(G)) +
      ' nodes.\nNodes represents a unique author with more than 1 paper and more than 10 cites')

betcent = pd.Series(nx.betweenness_centrality(G), name='Betweenness')
allauthors_merged = allauthors_merged.merge(
    betcent.to_frame(), left_index=True, right_index=True)

degcent = pd.Series(nx.degree_centrality(G), name='Degree')
allauthors_merged = allauthors_merged.merge(
    degcent.to_frame(), left_index=True, right_index=True)
This gives us a network with 15031 nodes.
Nodes represents a unique author with more than 1 paper and more than 10 cites

We can then merge all this data with some data collected manually from web searches related to their country of employment, their current employer, etc. We then rank in three different ways to analyze overlap between the three metrics.

In [37]:
authorsupplemental = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..',
                 'data', 'Support', 'Author_Supplmentary.csv')),
     encoding='latin-1', index_col=0)
allauthors_merged_withsupp = pd.merge(allauthors_merged,
                                      authorsupplemental,
                                      left_index=True, right_index=True,
                                      how='left')
allauthors_merged_withsupp['Betweenness'] = round(allauthors_merged_withsupp['Betweenness'],3)
allauthors_merged_withsupp['Degree'] = round(allauthors_merged_withsupp['Degree'],3)
allauthors_merged_withsupp['CitePercentile'] = round(allauthors_merged_withsupp.citedByCount.rank(pct=True),4)
allauthors_merged_withsupp[['Papers','citedByCount',
                           'GWAS-H','Betweenness',
                           'Degree','Country',
                           'Institution',
                           'CitePercentile']].sort_values(by='GWAS-H',
                                                         ascending=False).to_csv(os.path.abspath(
                                                                                 os.path.join('__file__',
                                                                                              '../..',
                                                                                              'tables',
                                                                                              'Authors.csv')))
allauthors_merged_withsupp.sort_values(by='GWAS-H', ascending=False).head(10)
Out[37]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Kari Stefansson 174 27537 84 0.019 0.306 Iceland deCode Genetics 1.0000
Andre G Uitterlinden 278 23305 76 0.018 0.367 Netherlands Erasmus MC 0.9998
Unnur Thorsteinsdottir 139 23610 76 0.006 0.240 Iceland deCode Genetics 0.9999
Albert Hofman 265 25502 76 0.013 0.346 U.S. University of Harvard 0.9999
Cornelia M van Duijn 188 20851 71 0.008 0.295 Netherlands UMC Rotterdam 0.9997
Gudmar Thorleifsson 117 20384 70 0.006 0.231 Iceland deCode Genetics 0.9995
Christian Gieger 166 22538 69 0.011 0.273 Germany Helmholtz-Muenchen 0.9997
Panos Deloukas 108 20307 68 0.009 0.232 U.K. Queen Mary UoL 0.9995
H-Erich Wichmann 111 20251 68 0.007 0.220 Germany Helmholtz-Muenchen 0.9994
Fernando Rivadeneira 197 17952 65 0.009 0.282 Netherlands UMC Rotterdam 0.9992
In [38]:
allauthors_merged_withsupp.sort_values(by='Degree',
                                       ascending=False).head(10)
Out[38]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Andre G Uitterlinden 278 23305 76 0.018 0.367 Netherlands Erasmus MC 0.9998
Albert Hofman 265 25502 76 0.013 0.346 U.S. University of Harvard 0.9999
Kari Stefansson 174 27537 84 0.019 0.306 Iceland deCode Genetics 1.0000
Cornelia M van Duijn 188 20851 71 0.008 0.295 Netherlands UMC Rotterdam 0.9997
Fernando Rivadeneira 197 17952 65 0.009 0.282 Netherlands UMC Rotterdam 0.9992
Jerome I Rotter 185 17806 59 0.011 0.281 U.S. UCLA 0.9991
Christian Gieger 166 22538 69 0.011 0.273 Germany Helmholtz-Muenchen 0.9997
Bruce M Psaty 167 12947 58 0.005 0.263 U.S. University of Washington 0.9965
Nicholas G Martin 161 13559 58 0.009 0.259 Australia QIMR Berghofer 0.9973
Vilmundur Gudnason 128 15356 59 0.003 0.258 Iceland I.H.A. 0.9983
In [39]:
allauthors_merged_withsupp.sort_values(by='citedByCount',
                                       ascending=False).head(10)
Out[39]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Kari Stefansson 174 27537 84 0.019 0.306 Iceland deCode Genetics 1.0000
Albert Hofman 265 25502 76 0.013 0.346 U.S. University of Harvard 0.9999
Unnur Thorsteinsdottir 139 23610 76 0.006 0.240 Iceland deCode Genetics 0.9999
Andre G Uitterlinden 278 23305 76 0.018 0.367 Netherlands Erasmus MC 0.9998
Christian Gieger 166 22538 69 0.011 0.273 Germany Helmholtz-Muenchen 0.9997
Cornelia M van Duijn 188 20851 71 0.008 0.295 Netherlands UMC Rotterdam 0.9997
Mark I McCarthy 98 20664 62 0.003 0.187 U.K. University of Oxford 0.9996
Gudmar Thorleifsson 117 20384 70 0.006 0.231 Iceland deCode Genetics 0.9995
Panos Deloukas 108 20307 68 0.009 0.232 U.K. Queen Mary UoL 0.9995
H-Erich Wichmann 111 20251 68 0.007 0.220 Germany Helmholtz-Muenchen 0.9994

And then make a simple correlation matrix to check how highly related these metrics are:

In [40]:
allauthors_merged_withsupp[['citedByCount', 'GWAS-H',
                            'Betweenness', 'Degree', 'Papers']].corr()
Out[40]:
citedByCount GWAS-H Betweenness Degree Papers
citedByCount 1.000000 0.894780 0.502013 0.819143 0.847613
GWAS-H 0.894780 1.000000 0.524466 0.881267 0.943747
Betweenness 0.502013 0.524466 1.000000 0.475675 0.637942
Degree 0.819143 0.881267 0.475675 1.000000 0.851511
Papers 0.847613 0.943747 0.637942 0.851511 1.000000
In [41]:
print('There are a total of ' +
      str(len(allauthors_merged_withsupp)) + ' authors in the table.')
print('The person with the highest G-WAS H-Index is: ' +
      allauthors_merged_withsupp['GWAS-H'].idxmax())
print('The person with the highest Degree is: ' +
      allauthors_merged_withsupp['Degree'].idxmax())
print('The person with the highest citedByCount is: ' +
      allauthors_merged_withsupp['citedByCount'].idxmax())
print('The person with the most number of Papers is: ' +
      allauthors_merged_withsupp['Papers'].idxmax())
There are a total of 15032 authors in the table.
The person with the highest G-WAS H-Index is: Kari Stefansson
The person with the highest Degree is: Andre G Uitterlinden
The person with the highest citedByCount is: Kari Stefansson
The person with the most number of Papers is: Andre G Uitterlinden

Author gender

Lets consider the gender of each author by analyzing their forenames using the genderguesser library based on gender.c by Michael Jorg. We can directly compare our results to this wonderful paper. One of the key assumptions made here is to combine all 'mostly_' male and female names into their respective male/female categories.

In [42]:
gendet = gender.Detector()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()

AuthorCounts = AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count(
).to_frame().reset_index().rename(columns={"AUTHORNAME": "Author Count"})
AuthorMaster = pd.merge(AuthorMaster, AuthorCounts, how='left', on='PUBMEDID')
AuthorMaster['CleanForename'] = AuthorMaster['FORENAME'].map(
    lambda x: clean_names(x))
AuthorMaster['CleanGender'] = AuthorMaster['CleanForename'].map(
    lambda x: gendet.get_gender(x))
AuthorMaster['MaleFemale'] = AuthorMaster['CleanGender'].str.replace('mostly_',
                                                                     '')
AuthorMaster['isfemale'] = np.where(
    AuthorMaster['MaleFemale'] == 'female', 1, 0)
AuthorFirst = AuthorMaster[AuthorMaster['Author Count']
                           > 4].drop_duplicates(subset='PUBMEDID', keep='first')
AuthorLast = AuthorMaster[AuthorMaster['Author Count']
                          > 4].drop_duplicates(subset='PUBMEDID', keep='last')
AuthorUnique = AuthorMaster.drop_duplicates(subset='AUTHORNAME')

for gender_ in AuthorMaster['CleanGender'].unique():
    print(str(round((len(AuthorMaster[AuthorMaster['CleanGender'] == gender_]) /
                     len(AuthorMaster['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authorship contributions in total')
    print(str(round((len(AuthorUnique[AuthorUnique['CleanGender'] == gender_]) /
                     len(AuthorUnique['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authors contributions in total')
    print(str(round((len(AuthorFirst[AuthorFirst['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' first authors in total')
    print(str(round((len(AuthorLast[AuthorLast['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' last authors in total')

print('\nPercent of male author contributions: ' +
      str(round(len(AuthorMaster[AuthorMaster['MaleFemale'] == 'male']) /
                (len(AuthorMaster[(AuthorMaster['MaleFemale'] == 'male') |
                                  (AuthorMaster['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

print('Percent of unique male authors: ' +
      str(round(len(AuthorUnique[AuthorUnique['MaleFemale'] == 'male']) /
                (len(AuthorUnique[(AuthorUnique['MaleFemale'] == 'male') |
                                  (AuthorUnique['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

print('Percent of male first authors: ' +
      str(round(len(AuthorFirst[AuthorFirst['MaleFemale'] == 'male']) /
                (len(AuthorFirst[(AuthorFirst['MaleFemale'] == 'male') |
                                 (AuthorFirst['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

print('Percent of male last authors: ' +
      str(round(len(AuthorLast[AuthorLast['MaleFemale'] == 'male']) /
                (len(AuthorLast[(AuthorLast['MaleFemale'] == 'male') |
                                (AuthorLast['MaleFemale'] == 'female')])) *
                100, 2)) + '%')

AuthorMaster_filtered = AuthorMaster[(AuthorMaster['MaleFemale'].str.contains(
    'male')) & (AuthorMaster['Author Count'] > 4)]
AuthorMaster_filtered_merged_bydisease = pd.merge(
    AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'DISEASE/TRAIT']],
    how='left', on='PUBMEDID')

meanfemales_bydisease = AuthorMaster_filtered_merged_bydisease.groupby(
    ['DISEASE/TRAIT'])['isfemale'].mean().to_frame()
numberofstudies_bydisease = Cat_Studies.groupby(
    ['DISEASE/TRAIT'])['DISEASE/TRAIT'].count().to_frame()
mergedgender_andcount_bydisease = pd.merge(
    numberofstudies_bydisease, meanfemales_bydisease, left_index=True,
    right_index=True)
mergedgender_andcount_bydisease = mergedgender_andcount_bydisease.sort_values(
    by='DISEASE/TRAIT', ascending=False)[0:10]
holdstring = 'Percent of authorships across 10 most commonly studied diseases:\n'
for index, row in mergedgender_andcount_bydisease.iterrows():
    holdstring = holdstring + \
        index.title() + ' (' + str(round(row['isfemale'], 2) * 100) + '%)\n'
print('\n' + holdstring[:-1])
26.1% female authorship contributions in total
25.19% female authors contributions in total
27.13% female first authors in total
19.5% female last authors in total
21.17% unknown authorship contributions in total
28.88% unknown authors contributions in total
27.99% unknown first authors in total
23.99% unknown last authors in total
1.34% mostly_male authorship contributions in total
1.51% mostly_male authors contributions in total
1.37% mostly_male first authors in total
1.72% mostly_male last authors in total
45.52% male authorship contributions in total
37.8% male authors contributions in total
34.93% male first authors in total
49.26% male last authors in total
4.48% andy authorship contributions in total
5.13% andy authors contributions in total
7.26% andy first authors in total
3.72% andy last authors in total
1.38% mostly_female authorship contributions in total
1.48% mostly_female authors contributions in total
1.32% mostly_female first authors in total
1.83% mostly_female last authors in total

Percent of male author contributions: 63.03%
Percent of unique male authors: 59.58%
Percent of male first authors: 56.07%
Percent of male last authors: 70.5%

Percent of authorships across 10 most commonly studied diseases:
Type 2 Diabetes (33.0%)
Body Mass Index (39.0%)
Breast Cancer (50.0%)
Colorectal Cancer (34.0%)
Schizophrenia (30.0%)
Prostate Cancer (41.0%)
Alzheimer'S Disease (40.0%)
Asthma (37.0%)
Height (37.0%)
Hdl Cholesterol (36.0%)
In [43]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

ranks, countstudiesperEFO, meanfemale,\
AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster,
                                                                 EFO_Parent_Mapped)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
                                           'svg', 'Sup_Figure_1_Commas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)

AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 2) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 2) *
                                             100) + '%)\n'
print('\n' + holdstring[:-1])
Get the numbers for Parent Terms:
Biological Process (41.0%)
Body Meas. (41.0%)
Cancer (47.0%)
Cardiovascular Disease (33.0%)
Cardiovascular Meas. (33.0%)
Digestive System Disorder (31.0%)
Hematological Meas. (34.0%)
Immune System Disorder (36.0%)
Inflammatory Meas. (33.0%)
Lipid/Lipoprotein Meas. (36.0%)
Liver Enzyme Meas. (35.0%)
Metabolic Disorder (34.0%)
Neurological Disorder (35.0%)
Other Disease (34.0%)
Other Meas. (38.0%)
Other Trait (36.0%)
Response To Drug (34.0%

Get the numbers for Trait Terms (figure above):
Breast Carcinoma (51.0%)
Prostate Carcinoma (42.0%)
Body Mass Index (41.0%)
Bipolar Disorder (37.0%)
Alzheimers Disease (37.0%)
Colorectal Cancer (37.0%)
Asthma (36.0%)
Triglyceride Meas. (36.0%)
Hd Lc Measurement (35.0%)
Unipolar Depression (35.0%)
Rheumatoid Arthritis (35.0%)
Ld Lc Measurement (35.0%)
Type Ii Diabetes Mellitus (33.0%)
Schizophrenia (31.0%)

Lets do exactly the same thing as above, but drop the fields with commas in:

In [44]:
ranks, countstudiesperEFO, meanfemale, AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster, EFO_Parent_Mapped_NoCommas)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
                                           'svg', 'Sup_Figure_1_NoCommas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)

AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 2) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 2) *
                                             100) + '%)\n'
print('\n' + holdstring[:-1])
Get the numbers for Parent Terms:
Biological Process (37.0%)
Body Meas. (40.0%)
Cancer (48.0%)
Cardiovascular Disease (31.0%)
Cardiovascular Meas. (32.0%)
Digestive System Disorder (32.0%)
Hematological Meas. (34.0%)
Immune System Disorder (36.0%)
Inflammatory Meas. (28.999999999999996%)
Lipid/Lipoprotein Meas. (36.0%)
Liver Enzyme Meas. (36.0%)
Metabolic Disorder (34.0%)
Neurological Disorder (35.0%)
Other Disease (34.0%)
Other Meas. (37.0%)
Other Trait (36.0%)
Response To Drug (33.0%

Get the numbers for Trait Terms (figure above):
Breast Carcinoma (50.0%)
Prostate Carcinoma (42.0%)
Body Mass Index (40.0%)
Alzheimers Disease (37.0%)
Triglyceride Meas. (36.0%)
Asthma (36.0%)
Bone Density (36.0%)
Hd Lc Measurement (35.0%)
Unipolar Depression (35.0%)
Ld Lc Measurement (35.0%)
Colorectal Cancer (34.0%)
Bipolar Disorder (34.0%)
Type Ii Diabetes Mellitus (33.0%)
Schizophrenia (30.0%)

Compare Author Metrics Between Genders

In [45]:
allauthors_merged_reset = allauthors_merged.reset_index().rename(columns={'index': 'Name'})
allauthors_merged_reset['CleanForename'] = allauthors_merged_reset['Name'].map(lambda x: x.split(' ')[0]) 
allauthors_merged_reset['CleanGender'] = allauthors_merged_reset['CleanForename'].map(
    lambda x: gendet.get_gender(x))
allauthors_merged_reset['MaleFemale'] = allauthors_merged_reset['CleanGender'].str.replace('mostly_',
                                                                     '')
allauthors_merged_reset['isfemale'] = np.where(
   allauthors_merged_reset['MaleFemale'] == 'female', 1, 0)

print('The average GWAS-H Index for females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']\
                ['GWAS-H'].mean(), 2)))
print('The average GWAS-H Index for males is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']\
                ['GWAS-H'].mean(), 2)))
print('The average number of papers published by females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']\
                ['Papers'].mean(), 2)))
print('The average number of papers published by males is: ' +
     str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']\
               ['Papers'].mean(), 2)))
print('The average number of citations for females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']\
                ['citedByCount'].mean(), 2)))
print('The average number of citations for males is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']\
                ['citedByCount'].mean(), 2)))
The average GWAS-H Index for females is: 4.86
The average GWAS-H Index for males is: 5.36
The average number of papers published by females is: 6.11
The average number of papers published by males is: 7.13
The average number of citations for females is: 650.39
The average number of citations for males is: 783.78

Cohorts and Consortium

Consortium/collectives

Lets use the PubMed database returns on collectives to analyze which consortium and study groups are acknowledged:

In [46]:
pubmed_collectives = pd.read_csv(os.path.abspath(
                                 os.path.join('__file__', '../..',
                                              'data', 'PUBMED',
                                              'Pubmed_CollectiveInfo.csv')),
                                 encoding='latin-1')
collect_dict = pd.read_csv(os.path.abspath(
                           os.path.join(
                               '__file__', '../..', 'data',
                               'Support', 'Collectives',
                               'Collective_Dictionary.csv')),
                           encoding='latin-1')
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.strip()
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.lower()
collect_dict['COLLECTIVE'] = collect_dict['COLLECTIVE'].str.strip(
).str.lower()
merged_collect = pd.merge(
    pubmed_collectives, collect_dict, how='left', on='COLLECTIVE')
if len(merged_collect[merged_collect['Clean_Name'].isnull()]) > 0:
    print('Danger! Correct collectives in Collective_Unverified.csv ' +
          'and add to the dictionary\n')
    unverified_collect = merged_collect[merged_collect['Clean_Name'].isnull(
    )]
    unverified_collect.to_csv(os.path.abspath(
                              os.path.join('__file__', '../..',
                                           'data', 'Support', 'Collectives',
                                           'Collective_Unverified.csv')),
                              encoding='latin-1')
else:
    print('The consortium dictionary is up to date!\n')
consortiumlist = []
for index, row in merged_collect.iterrows():
    if pd.isnull(row['Clean_Name']) is False:
        for consortium in row['Clean_Name'].split(';'):
            consortiumlist.append(consortium)
clean_collectives = pd.DataFrame(consortiumlist, columns=['Consortium'])
groupby_collectives = clean_collectives.groupby(
    ['Consortium'])['Consortium'].count().sort_values(ascending=False)
holdstring = 'The most frequently seen Consortium are:\n'
for index, row in groupby_collectives.to_frame()[0:10].iterrows():
    holdstring = holdstring + index + ' (' + str(row['Consortium']) + ')\n'
print(holdstring[:-2] + ')')

print('\nWe have seen a total of ' +
      str(len(groupby_collectives.to_frame())) + ' different collectives!')
print('A total of ' + str(len(pubmed_collectives['PUBMEDID'].drop_duplicates(
))) + ' papers are contributed to by at least one collective.')
print('A total of ' + str(groupby_collectives.sum()) +
      ' collective contributions are made.')
The consortium dictionary is up to date!

The most frequently seen Consortium are:
Wellcome Trust Case Control Consortium (49)
CHARGE (45)
Wellcome Trust Case Control Consortium 2 (36)
LifeLines Cohort Study (30)
DIAGRAM (29)
Alzheimer's Disease Neuroimaging Initiative (25)
CARDIOGRAM (24)
MAGIC (20)
GIANT Consortium (18)
kConFab (16)

We have seen a total of 675 different collectives!
A total of 833 papers are contributed to by at least one collective.
A total of 1638 collective contributions are made.

Datasets

Lets now simply wrangle together some of the manually collated data and consider the most frequently observed cohorts, with the aim being to analyze the resampling issue discussed earlier in the literature.

In [47]:
finaldataset = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..', 'data',
                 'Support', 'Cohorts',
                 'manually_parsed_data.csv')),
    encoding='latin-1')
mylist = []
for index, row in finaldataset.iterrows():
    mylist = mylist + row['DATASETS'].split(';')
mylist = [x.strip().upper() for x in mylist]
df1 = pd.DataFrame({'Cohorts': mylist})
reader = csv.reader(open(os.path.abspath(os.path.join(
    '__file__', '../..', 'data', 'Support', 'Cohorts',
    'Dictionary_cohorts.csv')), 'r'))
d = {}
for row in reader:
    k, v = row
    d[k] = v
for key in d:
    df1['Cohorts'].replace(key, d[key], inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.replace(
    'Rotterdam Study I (RS-I)', 'Rotterdam Study (RS)')
df1['Cohorts'].replace('&', 'and', inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.upper()
newdf = pd.DataFrame(df1.groupby(['Cohorts'])[
                     'Cohorts'].count(), columns=['Count'])
newdf = pd.DataFrame({'Count': df1.groupby(['Cohorts']).size()}).reset_index()

Then merge this with manually collated data:

In [48]:
manual_cohort_for_merge = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..", "data",
                 "Support", "Cohorts",
                 "manual_cohort_for_merge.csv")),
    encoding='latin-1', index_col=False)
merged_manual = pd.merge(newdf, manual_cohort_for_merge,
                         how='left', on='Cohorts')
merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME').head(10).style
merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME').to_csv(os.path.abspath(
                                                        os.path.join('__file__',
                                                                     '../..',
                                                                     'tables',
                                                                     'ManualLY_Curated_Cohorts.csv')))

Lets now print out a list of the 50 most commonly used datasets in case we need to reference any specific particularily prominent ones (e.g. deCODE, UKBioBank etc):

In [49]:
merged_manual=merged_manual.set_index('Cohorts').sort_values(by='Count', ascending=False)
holder='The 30 most commonly used datasets are:\n\n'
for index, row in merged_manual[0:30].iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Count']) + ' times)\n'
print(holder[:-2])
merged_manual[['Count']].to_csv(os.path.abspath(
                                os.path.join('__file__',
                                             '../..',
                                             'data',
                                             'Support',
                                             'Cohorts',
                                             'Cohort_Count.csv')), encoding='latin-1')
The 30 most commonly used datasets are:

ROTTERDAM STUDY (RS) (398 times)
NO NAME (297 times)
COOPERATIVE HEALTH RESEARCH IN THE REGION OF AUGSBURG (KORA) (255 times)
FRAMINGHAM HEART STUDY (FHS) (207 times)
ATHEROSCLEROSIS RISK IN COMMUNITIES STUDY (ARIC) (204 times)
CARDIOVASCULAR HEALTH STUDY (CHS) (179 times)
BRITISH 1958 BIRTH COHORT STUDY (B58C) (156 times)
UK ADULT TWIN REGISTER (TWINSUK) (140 times)
EUROPEAN PROSPECTIVE INVESTIGATION INTO CANCER (EPIC) (132 times)
NURSES HEALTH STUDY (NHS) (129 times)
STUDY OF HEALTH IN POMERANIA (SHIP) (127 times)
AGING GENE-ENVIRONMENT SUSCEPTIBILITY - REYKJAVIK STUDY (AGES) (119 times)
DECODE GENETICS (DECODE) (113 times)
ERASMUS RUCPHEN FAMILY STUDY (ERF) (104 times)
WELLCOME TRUST CASE CONTROL CONSORTIUM (WTCCC) (103 times)
MULTI-ETHNIC STUDY OF ATHEROSCLEROSIS (MESA) (98 times)
HEALTH, AGING, AND BODY COMPOSITION STUDY (HEALTH ABC) (92 times)
NORTHERN FINNISH BIRTH COHORT STUDY (NFBC) (87 times)
NETHERLANDS TWIN REGISTRY (NTR) (78 times)
ORKNEY COMPLEX DISEASE STUDY (ORCADES) (78 times)
LOTHIAN BIRTH COHORT (LBC) (78 times)
ESTONIAN GENOME CENTER OF THE UNIVERSITY OF TARTU (EGCUT) (75 times)
AVON LONGITUDINAL STUDY OF PARENTS AND CHILDREN (ALSPAC) (75 times)
COHORTE LAUSANNOISE (COLAUS) (75 times)
THE INVECCHIARE IN CHIANTI (INCHIANTI) (70 times)
WOMENS GENOME HEALTH STUDY (WGHS) (69 times)
HEALTH PROFESSIONAL FOLLOW-UP STUDY (HPFS) (69 times)
SARDINIA STUDY OF AGING (SARDINIA) (68 times)
BIOBANK JAPAN PROJECT (BBJ) (68 times)
CARDIOVASCULAR RISK IN YOUNG FINNS STUDY (YFS) (67 times
In [50]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
PUBMED_by_N = pd.DataFrame(Cat_Ancestry.groupby(['PUBMEDID'])['N'].agg('sum').sort_values(ascending=False))
pubmed_manual = pd.merge(PUBMED_by_N.reset_index(), finaldataset, left_on='PUBMEDID',
                                 right_on='PUBMEDID', how='left')
print('We have counted ' + str(len(df1['Cohorts'])) + ' (non-unique) cohorts in total')
print('We have counted ' + str(len(newdf)) + ' unique cohorts in total.')
print("We're currently missing " +
      str(len(set(PUBMED_by_N[0:1000].index)-set(finaldataset['PUBMEDID']))) + 
     " of the biggest 1000 papers grouped by summed ancestry!")
print('We currently have coverage for ' + 
      str(round((pubmed_manual[pubmed_manual['In Charge'].notnull()]['N'].sum()/
          PUBMED_by_N['N'].sum())*100,2)) +
      '% of all papers by N.')
print('We currently have coverage for ' + 
      str(round((len(finaldataset)/len(PUBMED_by_N))*100,2)) +
      '% of all papers in absolute.')
pubmed_manual.to_csv(os.path.abspath(
                             os.path.join('__file__',
                                          '../..',
                                          'data',
                                          'Support',
                                          'Cohorts',
                                          'Merged_Cohorts_Manual_Curation.csv')),
                     encoding='latin-1')
not_curated = pubmed_manual[~pubmed_manual['In Charge'].notnull()]['PUBMEDID']
not_curated.to_csv(os.path.abspath(os.path.join('__file__', '../..', 'data',
                                                'Support', 'Cohorts',
                                                'Outstanding_PUBMEDIDs_To_Curate.csv')),
                   encoding='latin-1')
We have counted 11851 (non-unique) cohorts in total
We have counted 2000 unique cohorts in total.
We're currently missing 47 of the biggest 1000 papers grouped by summed ancestry!
We currently have coverage for 86.01% of all papers by N.
We currently have coverage for 34.47% of all papers in absolute.